Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/ikawrakow/ik_llama.cpp/llms.txt

Use this file to discover all available pages before exploring further.

Prerequisites

On Debian/Ubuntu:
apt-get update && apt-get install build-essential git libcurl4-openssl-dev curl libgomp1 cmake

CMake flags

Pass flags to the initial cmake -B build invocation.
FlagDefaultDescription
GGML_NATIVEOFFOptimize for the host CPU (-march=native). Turn off when cross-compiling.
GGML_CUDAOFFBuild with CUDA support. Requires the NVIDIA CUDA Toolkit. Defaults to native CUDA architecture detection.
CMAKE_CUDA_ARCHITECTURESautoTarget a specific GPU compute capability, e.g. 86 for RTX 30-series.
GGML_RPCOFFBuild the RPC backend for distributed inference across machines.
GGML_IQK_FA_ALL_QUANTSOFFEnable all KV cache quantization types for Flash Attention (beyond the default f16, q8_0, q6_0, and bf16).
GGML_NCCLONEnable NCCL for multi-GPU communication. Set to OFF to disable.
LLAMA_SERVER_SQLITE3OFFBuild SQLite3 support into llama-server (required for the mikupad web UI).
CPU build example
cmake -B build -DGGML_NATIVE=ON
cmake --build build --config Release -j$(nproc)
CUDA build example
cmake -B build -DGGML_NATIVE=ON -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86
cmake --build build --config Release -j$(nproc)

Environment variables

Set these in the shell before invoking llama-server or any other tool.
VariableDescription
CUDA_VISIBLE_DEVICESRestrict which GPUs are visible. Example: CUDA_VISIBLE_DEVICES=0,2 uses the first and third GPU only.
GGML_CUDA_ENABLE_UNIFIED_MEMORYSet to 1 to enable CUDA Unified Memory, allowing the GPU to access host RAM when VRAM is exhausted. Useful for large models on systems with limited VRAM.
CUDA_VISIBLE_DEVICES=0,2 llama-server --model /models/model.gguf -ngl 999
The only fully supported compute backends are CPU (AVX2 or better, ARM NEON or better) and CUDA. ROCm, Vulkan, and Metal are available but not actively maintained.